Policy Improvement for POMDPs Using Normalized Importance Sampling
نویسنده
چکیده
We present a new method for estimating the expected return of a POMDP from experi ence. The estimator does not assume any knowledge of the POMDP, can estimate the returns for finite state controllers, allows ex perience to be gathered from arbitrary se quences of policies, and estimates the return for any new policy. We motivate the estima tor from function-approximation and impor tance sampling points-of-view and derive its bias and variance. Although the estimator is biased, it has low variance and the bias is of ten irrelevant when the estimator is used for pair-wise comparisons. We conclude by ex tending the estimator to policies with mem ory and compare its performance in a greedy search algorithm to the REINFORCE algo rithm showing an order of magnitude reduc tion in the number of trials required.
منابع مشابه
Importance Sampling Estimates for Policies with Memory
Importance sampling has recently become a popular method for computing off-policy Monte Carlo estimates of returns. It has been known that importance sampling ratios can be computed for POMDPs when the sampled and target policies are both reactive (memoryless). We extend that result to show how they can also be efficiently computed for policies with memory state (finite state controllers) witho...
متن کاملMonte Carlo Sampling Methods for Approximating Interactive POMDPs
Partially observable Markov decision processes (POMDPs) provide a principled framework for sequential planning in uncertain single agent settings. An extension of POMDPs to multiagent settings, called interactive POMDPs (I-POMDPs), replaces POMDP belief spaces with interactive hierarchical belief systems which represent an agent’s belief about the physical world, about beliefs of other agents, ...
متن کاملThe Cross-Entropy Method for Policy Search in Decentralized POMDPs
Decentralized POMDPs (Dec-POMDPs) are becoming increasingly popular as models for multiagent planning under uncertainty, but solving a Dec-POMDP exactly is known to be an intractable combinatorial optimization problem. In this paper we apply the Cross-Entropy (CE) method, a recently introduced method for combinatorial optimization, to Dec-POMDPs, resulting in a randomized (sampling-based) algor...
متن کاملAn Empirical Analysis of Off-policy Learning in Discrete MDPs
Abstract Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-hor...
متن کاملA comparative study of counterfactual estimators
We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dom...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001